{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c29851a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import dask.dataframe as dd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37cf1f6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from fairlearn.metrics import (\n",
    "    MetricFrame,\n",
    "    true_positive_rate,\n",
    "    false_negative_rate,\n",
    "    false_positive_rate,\n",
    "    count\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9422c741",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e302f11",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "import logging\n",
    "\n",
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.CRITICAL)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "682924f6",
   "metadata": {},
   "source": [
    "# Sample Notebook - Face Validation "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf6b1ec",
   "metadata": {},
   "source": [
    "This Jupyter notebook walks you through an example of assessing a face validation system for any potential fairness-related disparities. You can either use the provided sample CSV file `face_verify_sample_rand_data.csv` or use your own dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b31ad634",
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile\n",
    "from raiutils.dataset import fetch_dataset\n",
    "outdirname = 'responsibleai.12.28.21'\n",
    "zipfilename = outdirname + '.zip'\n",
    "\n",
    "fetch_dataset('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n",
    "\n",
    "with zipfile.ZipFile(zipfilename, 'r') as unzip:\n",
    "    unzip.extractall('.')\n",
    "results_csv = \"face_verify_sample_rand_data.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca3ce668",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(results_csv, index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a343ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d7e1250",
   "metadata": {},
   "source": [
    "Our fairness assessment can be broken down into three tasks:\n",
    "\n",
    "1. Idenfity harms and which groups may be harmed.\n",
    "\n",
    "2. Define fairness metrics to quantify harms\n",
    "\n",
    "3. Compare our quantified harms across the relevant groups."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5585c06",
   "metadata": {},
   "source": [
    "## 1.) Identify which groups may be harmed and how"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79492a3e",
   "metadata": {},
   "source": [
    "The first step of our fairness assessment is understanding which groups are more likely to be *adversely affected* by our face verification system."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7edd10f4",
   "metadata": {},
   "source": [
    "The work of Joy Buolamwini and Timnit Gebru on *Gender Shades* ([Buolamwini and Gebru, 2018](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf)) showed a performance disparity in the accuracy of commerically available facial recognition systems between darker-skinned women and lighter-skinned men. One key takeaway from this work is the importance of intersectionality when conducting a fairness assessment. For this fairness assessment, we will explore performance disparities disaggregated by `race` and `gender`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd393f5f",
   "metadata": {},
   "source": [
    "Using the terminology recommended by the [Fairlearn User Guide](https://fairlearn.org/v0.7.0/user_guide/fairness_in_machine_learning.html#fairness-of-ai-systems), we are interested in mitigating **quality-of-service harms**. **Quality-of-Service** harms are focused on whether a systems achieves the same level of performance for one person as it does for others, even when no opportunities or resources are withheld.  The *Face validation* system produces this harm if it fails to validate faces for members of one demographic group higher compared to other demographic groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f540e82",
   "metadata": {},
   "outputs": [],
   "source": [
    "sensitive_features = [\"race\", \"gender\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b7e22a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(sensitive_features)[\"golden_label\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bf3b022",
   "metadata": {},
   "source": [
    "The `matching_score` represents the probability the two images represent the same face, according to the vision model. We say two faces *match* if the `matching_score` is greater than a specific threshold, `0.5` by default. Based on your needs, you can increase or decrease this threshold to any value between `0.0` and `1.0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96c772c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "threshold = 0.5\n",
    "df.loc[:, \"matching_score_binary\"] = df[\"matching_score\"] >= threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04f3c6be",
   "metadata": {},
   "source": [
    "## 2.) Define fairness to quantify harms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caa0ff75",
   "metadata": {},
   "source": [
    "The second step of our fairness assessment is to translate our fairness-related harms into quantifiable metrics.With face validation, there are two harms we should consider:\n",
    "\n",
    "1. *False Positives* where two different faces are considered by the system to be matching. A *false positive* can be extremely dangerous in many cases, such as security authentication. We would not want people to unlock someone else's phone due to a Face ID false positive.\n",
    "\n",
    "2. *False Negatives* occur when two pictures of the same person are not considered to be a match by the system. A *false negative* may result in an individual being locked out their account due to a lack of facial verifications. However in many cases, *false negatives* are not nearly as harmful as *false positives*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b50183be",
   "metadata": {},
   "source": [
    "To assess fairness-related disparities using the `MetricFrame`, we must first specify our *sensitive features* `A` along with our `fairness_metrics`. In this scenario, we will look at three different *fairness metrics*:\n",
    "- `count`: The number of data points in each demographic category.\n",
    "- `FNR`: The false negative rate for the group.\n",
    "- `FPR`: The false positive rate for the group."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "794f4306",
   "metadata": {},
   "source": [
    "With our system, we want to keep *false_positives* as low as possible while also not yielding too much disparity in the *false_negative_rate* for each group. For our example, we will look at the system's performance disaggregated by `race` and `gender`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f293c83",
   "metadata": {},
   "outputs": [],
   "source": [
    "A, Y = df.loc[:, sensitive_features], df.loc[:, \"golden_label\"]\n",
    "Y_pred = df.loc[:, \"matching_score_binary\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c6ac1e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "fairness_metrics = {\n",
    "    \"count\": count,\n",
    "    \"FNR\": false_negative_rate,\n",
    "    \"FPR\": false_positive_rate\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad0c47db",
   "metadata": {},
   "source": [
    "## 3.) Compared quantified harms across different groups"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cecc32ff",
   "metadata": {},
   "source": [
    "In the final step of our fairness assessment, we instantiate our `MetricFrame` by defining the following parameters:\n",
    "\n",
    "- *metrics*: The metrics of interest for our fairness assessment.\n",
    "- *y_true*: The ground truth labels for the ML task\n",
    "- *y_pred*: The model's predicted labels for the ML tasks\n",
    "- *sensitive_features*: The set of feature(s) for our fairness assessment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e5023e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "metricframe = MetricFrame(\n",
    "    metrics=fairness_metrics,\n",
    "    y_true=Y,\n",
    "    y_pred=Y_pred,\n",
    "    sensitive_features=A\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76f25bd0",
   "metadata": {},
   "source": [
    "With our `MetricFrame`, we can call the `by_group` function to view our `fairness_metrics` dissaggregated by our different demographic groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe25786d",
   "metadata": {},
   "outputs": [],
   "source": [
    "metricframe.by_group"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a191d6b",
   "metadata": {},
   "source": [
    "With the `difference` method, we can view the maximal disparity in each metric. We see there is a maximal `false negative rate difference` between `Black female` and `White male` of `0.0177`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caf5b439",
   "metadata": {},
   "outputs": [],
   "source": [
    "metricframe.difference()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b4a405a",
   "metadata": {},
   "source": [
    "### Applying Different Thresholds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23361d07",
   "metadata": {},
   "source": [
    "In the previous section, we used a *threshold* of `0.5` to determine the minimum `matching_score` needed for a successful match. In practice, we could choose any *threshold* between 0.0 and 1.0 to get a *false negative rate* and *false positive rate* that is acceptable for the specific task.\n",
    "\n",
    "Now, we're going to explore how changing the threshold affects the resultant *false positive rate* and *false negative rate*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98bda06a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def update_dictionary_helper(dictionary, results):\n",
    "    for (k, v) in results.items():\n",
    "        dictionary[k].append(v)\n",
    "    return dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "678962c6",
   "metadata": {},
   "source": [
    "The following function iterates through a set of potential thresholds and computes the resultant model predictions at each threshold. The function then creates a `MetricFrame` to compute the disaggregated metrics at this threshold level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ccbd06b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_group_thresholds_dask(dataframe, metric, A,bins=10):\n",
    "    thresholds = np.linspace(0, 1, bins+1)[1:]\n",
    "    full_dict = defaultdict(list)\n",
    "    for threshold in thresholds:\n",
    "        Y_pred_threshold = dataframe.loc[:, \"matching_score\"] >= threshold\n",
    "        metricframe_threshold = MetricFrame(\n",
    "            metrics={f\"{metric.__name__}\": metric},\n",
    "            y_true= dataframe.loc[:, \"golden_label\"],\n",
    "            y_pred = Y_pred_threshold,\n",
    "            sensitive_features=A\n",
    "        )\n",
    "        results = metricframe_threshold.by_group[metric.__name__].to_dict()\n",
    "        full_dict = update_dictionary_helper(full_dict, results)\n",
    "    return full_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186be386",
   "metadata": {},
   "source": [
    "Using the `plot_thresholds` function, we can visualize the `false_positive_rate` and `false_negative_rate` for the data at each *threshold* level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5881e7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_thresholds(thresholds, thresholds_dict,metric):\n",
    "    plt.figure(figsize=[12,8])\n",
    "    for (k, vals) in thresholds_dict.items():\n",
    "        plt.plot(thresholds, vals, label=f\"{k}\")\n",
    "        plt.scatter(thresholds, vals, s=20)\n",
    "    plt.xlabel(\"Threshold\")\n",
    "    plt.xticks(thresholds)\n",
    "    plt.ylabel(f\"{metric.__name__}\")\n",
    "    plt.legend(bbox_to_anchor=(1,1), loc=\"upper left\")\n",
    "    plt.grid(b=True, which=\"both\", axis=\"both\", color='gray', linestyle='dashdot', linewidth=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "834f0446",
   "metadata": {},
   "outputs": [],
   "source": [
    "thresholds = np.linspace(0, 1, 11)[1:]\n",
    "fn_thresholds_dict = compute_group_thresholds_dask(df, false_negative_rate, A)\n",
    "fp_thresholds_dict = compute_group_thresholds_dask(df, false_positive_rate, A)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a2895ed",
   "metadata": {},
   "source": [
    "From the visualization, we see the *false_negative_rate* for all groups increases as the threshold increases. Furthermore, the maximal `false_negative_rate_difference` occurs between *White Female* and *Black Male* when the `threshold` is set to `0.7`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8977b81",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_thresholds(thresholds, fn_thresholds_dict, false_negative_rate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "879970fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_thresholds(thresholds, fp_thresholds_dict, false_positive_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32f1da24",
   "metadata": {},
   "source": [
    "If it were essential to keep the *false_positive_rate* at 0 for all groups, then according to the plots above, we simply need to choose a *threshold* greater than or equal to 0.5. However increasing the *threshold* above *0.5* in our data also increases the **absolute false negative rate** across all groups as well as the *relative false negative rate difference* between groups."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab15fa20",
   "metadata": {},
   "source": [
    "### Comparison to Synthetic Disparity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b07e294b",
   "metadata": {},
   "source": [
    "In our dataset, there isn't a substantial disparity in the `false_negative_rate` between the different demographic groups. In this section, we will introduce a synthetic `race_synth` feature to illustate what the results would look like if a disparity were present. We generate `race_synth` such that the feature is uncorrelated with `gender` and dependent entirely on the `golden_label`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7401cf5",
   "metadata": {},
   "source": [
    "If `golden_label` is `0`, then the synthetic `matching_score` is drawn from `Uniform(0, 0.5)`. If the synthetic `golden_label` is `1`, then the `matching_score` is drawn from `Uniform(0, 1)`. The below function `create_disparity` creates additional rows in the DataFrame using this process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ccef68d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_disparity(dataframe, num_rows=2000):\n",
    "    n = dataframe.shape[0]\n",
    "    synth_ground_truth = np.random.randint(low=0,high=2, size=num_rows)\n",
    "    synth_gender = np.random.choice([\"Male\", \"Female\"], size=num_rows)\n",
    "    synth_match_score = np.random.random(size=num_rows)/(2.0-synth_ground_truth)\n",
    "    \n",
    "    new_indices = range(n, n+num_rows)\n",
    "    src_imgs, dst_imgs = [f\"Source_Img_{i}\" for i in new_indices], [f\"Target_Img_{i}\" for i in new_indices]\n",
    "    synth_rows = pd.DataFrame.from_dict({\n",
    "        \"source_image\": src_imgs,\n",
    "        \"target_image\": dst_imgs,\n",
    "        \"race\": [\"race_synth\" for i in new_indices],\n",
    "        \"gender\": synth_gender,\n",
    "        \"golden_label\": synth_ground_truth,\n",
    "        \"matching_score\": synth_match_score\n",
    "    })\n",
    "    return synth_rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb1283e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "disp = create_disparity(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44496362",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_df = pd.concat([df, disp], axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "603b8c0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_df.loc[:, \"matching_score_binary\"] = synth_df[\"matching_score\"] > threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ebdcbb2",
   "metadata": {},
   "source": [
    "Now we create another `MetricFrame` with the same parameters as above one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31a455d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_metricframe = MetricFrame(\n",
    "    metrics=fairness_metrics,\n",
    "    y_true=synth_df.loc[:, \"golden_label\"],\n",
    "    y_pred=synth_df.loc[:,\"matching_score_binary\"],\n",
    "    sensitive_features=synth_df.loc[:, sensitive_features]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39165178",
   "metadata": {},
   "source": [
    "Now when we call `by_group` on this new `MetricFrame`, we can easily see the vast disparity between the `race_synth` groups and the other racial groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03581522",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_metricframe.by_group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d59dea89",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_metricframe.difference()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5432753",
   "metadata": {},
   "outputs": [],
   "source": [
    "synth_metricframe.by_group.plot(kind=\"bar\", y=\"FNR\", figsize=[12,8], title=\"FNR by Race and Gender\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0462d1b",
   "metadata": {},
   "source": [
    "## Fairness Assessment Dashboard"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e836ddb",
   "metadata": {},
   "source": [
    "With the `raiwidgets` library, we can use the `FairnessDashboard` to visualize the disparities between our different `race` and `gender` demographics. We pass in our *sensitive_features*, *golden_labels*, and *thresholded matching scores* to the dashboard. We can view the **dashboard** either within this Jupyter notebook or at a separate **localhost**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92ef06a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from raiwidgets import (\n",
    "    FairnessDashboard\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d916eb9a",
   "metadata": {},
   "source": [
    "We instantitate the `FairnessDashboard` by passing in three parameters:\n",
    "- `sensitive_feature`: The set of sensitive features\n",
    "- `y_true`: The ground truth labels\n",
    "- `y_pred`: The model's predictive labels\n",
    "\n",
    "The `FairnessDashboard` can either be accessed within the Jupyter notebook or by going to the *localhost url*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52813001",
   "metadata": {},
   "outputs": [],
   "source": [
    "FairnessDashboard(\n",
    "    sensitive_features=synth_df.loc[:, sensitive_features],\n",
    "    y_true=synth_df.loc[:, \"golden_label\"],\n",
    "    y_pred=synth_df.loc[:,\"matching_score_binary\"]\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
